glPca order of magnitude faster#150
Conversation
merge upstream
|
Sounds great, many thanks for this. It would be great to integrate this to Another reform would be to use .Call to avoid copying the data, but maybe Would you like to PR the integration above? No worries if not, but I'll In any case, thanks again for a welcome tweak! On 14 Aug 2016 21:51, "Libor Mořkovský" notifications@github.com wrote:
|
|
It was kind of a 'weekend project', I don't really have time for a proper refactoring. It would probably make sense to drop some parts of the code - eg. the parallel R implementation - to keep the code simple. If the C code will survive, the NA handling needs some work - probably replace the NA list by a bitmap which can be accessed by one pointer addition and masking or do the NA handling beforehand. Maybe I'd find some time to do bits of it if we agree on particular approach. |
|
Hi there, |
|
The failure is due to a lack of documentation for the new function. |
|
@thibautjombart how relevant is this PR 8 years later? I'm cleaning up my PR list, and would like to get rid of this. Is that change (and the whole package) still relevant and some bits of my low-effort help would be worth it, or noone cares any more and I'll just close the PR? Excuse my silly questions, I've left the field a few years ago.. |
|
I've been the maintainer of this since the pandemic and I do think it's still worthwhile incorporating, especially given these values. If it's alright with you, I can merge it into a separate branch and then work on it from there. I will also add you as an author to the description as "Libor Mořkovský". |
|
@zkamvar sure, do anything you need to get this moving. tag me, should you need any assistance! |
|
closing due to @zkamvar 's plan to move the work on this to another branch |
For 'wide' data sets (~100 individuals, <100k SNPs) - especially with >100k NAs,
glPcagets really slow - minutes to hours. The reason is twofold - it either uses cross product implementation with the 'hot loop' in R, or a C implementation with very suboptimal handling of NAs (linear search for NA positions insnpbin.c::snpbin_isna).This PR introduces a new function
glPcaFast, which cuts the processing time to seconds. It tries to be as compatible as possible with the original. Most of the code is a copy of the original function, only the cross product loop was replaced.It prepares the data as a full
matrix, scaled and centered, with NAs replaced. Memory size of the matrix grows linearly with the size of thegenlightobject (eg for 160 MB vcf, 6 MB genlight it's ~60 MB).glPcaFastthen usesbase::tcrossprodto calculate the result in an efficient way. It's trading memory usage for speed and code simplicity, because 100 MB RAM is close to nothing nowadays.Ideally this would be tested more thoroughly, merged with the original
glPcaand used as a default unless the data set is really huge. But I didn't want to break some other code possibly depending on some particular behavior of the originalglPcaand would prefer to leave it on the author.